Bulk RNA sequencing data on 24 pre-treatment breast cancer tumours
performed by Wu et al. [1] was downloaded from
GEO with accession GSE176078,
and associated
publication.
geo_id <- "GSE176078"
This dataset was sequenced using Illumina NextSeq 500 (Homo sapiens), submitted on Apr 15 2014.
Breast cancers are clinically stratified based on:
This results in the following three clinical subtypes within this dataset:
Breast cancers are also stratified on bulk transcriptomic profiling via PAM50 gene signatures [2] describing five molecular subtypes: Luminal A, Luminal B, HER2-enriched, Basal-like, and Normal-like.
The ~70-80% concordance between clinical and molecular subtypes motivated this study to improve functional understanding of breast cancer in a broader and more coherent sense.
For this study, the clinical subtype conditions classified are as follows:
HER2+ ; (HER2+, ER-,
PR+/-)HER2+/ER+ ; (HER2+, ER+,
PR+/-)ER+ ; (HER2-, ER+,
PR+/-)TNBC ; (HER2-, ER-,
PR-)The distribution of clinical subtypes across the 24 samples is shown in table 1.
| Count | Percent | |
|---|---|---|
| HER2+ | 2 | 8.33 |
| HER2+/ER+ | 2 | 8.33 |
| TNBC | 8 | 33.33 |
| ER+ | 12 | 50.00 |
| Total | 24 | 100.00 |
This leads to 16.6666667% of samples containing HER2
expression, and 41.6666667% of samples containing ER
expression, with the overlaps of these representing 8.3333333% of all
samples; 50% of those with HER2, 14.2857143% of those with
ER.
The raw counts were cleaned, mapped to HUGO Gene Nomenclature Committee (HGNC) Symbols, and normalized to produce final counts which will be used for the duration of this report.
Of the original 58387 genes, we were able to map and
produce 28712 unversioned, unique genes, of which filtering
outliers (keeping genes present in a minimum of 12 samples)
resulted in 14800 genes remaining. This results in the
following plots.
The density plot shows the distribution of the cleaned, filtered, and normalized counts per million across all samples (varying colours). It displays a smooth curve as the result of our pre-processing.
Dispersion was calculated using edgeR to describe
deviation of variance from the mean. The Biological Coefficient of
Variation (BCV) is dispersion-squared, and represents the mean-variance
relationship among genes.
The normalized counts initialized above are formatted as shown by the subsection displayed in table 2.
| CID3586 | CID3921 | CID3941 | CID3948 | CID3963 | |
|---|---|---|---|---|---|
| WASH7P | 17.5790 | 1.0450 | 14.7274 | 0.0000 | 28.5314 |
| CICP27 | 1.9276 | 3.6553 | 2.2895 | 0.0000 | 2.1353 |
| MTND2P28 | 175.8840 | 174.5844 | 168.6025 | 354.9525 | 450.4684 |
| MTATP6P1 | 413.8903 | 589.8450 | 462.3509 | 1053.1526 | 404.6456 |
| FAM87B | 2.6751 | 1.9207 | 2.7736 | 1.5650 | 2.4742 |
| LINC01128 | 26.6312 | 35.3648 | 17.6844 | 20.9555 | 21.8894 |
Given our normalized expression dataset contains only grouped
classification via clinical subtypes, these are the obvious choice for
factors. But, due to the nature of these classifiers, the names
themselves represent binary pairing of HER2 and
ER expression and absence (in a clinical setting). This
provides some flexibility for analyzing differential expression since we
really have two protein expressions, with their binary pairing as the
clinical subtype classification.
The Multi-Dimensional Scaling (MDS) plot was produced using
limma on the normalized data, and it shows the distance
between samples factored by the four distinct clinical subtypes.
Figure 3: Multi-Dimensional Scaling plot for clinical subtypes of breast cancer as classifications of overexpression, or lack thereof, of ER and HER2
In the given MDS plot, notice that blue and
red are opposites (+/- and -/+),
purple is the intermediary between blue and
red (+/+), whereas green
represents TNBC: double negative (-/-).
Although progesterone is a marker for breast cancer, its expression is
not quantifiably present within the dataset, and so is not included.
However, TNBC implies PR is not overly
expressed as well.
In terms of these colours in relation to each other, you can see some clustering appearing, although nothing is confidently distinct.
I redefine the model to include the clinical subtypes of breast
cancer as factors, both as four separate classifiers, and the binary
pairing of over/under expression of HER2 and
ER.
# aggregate model design
model_design <- model.matrix(~ types_df$subtype)
# split model design (binary pairs)
splt_model_design <- model.matrix(~ types_df$her2 + types_df$er)
Given both design choices, I used edgeR’s
Quasi likelihood model to create quasi-likelihood fits.
# normalized counts grouped by clinical subtype
d <- DGEList(counts=norm_matrix,
group=types_df$subtype)
# estimate dispersion on subtype model design
d_ <- estimateDisp(d, model_design)
qlfit <- glmQLFit(d_, model_design)
# in parallel, estimate dispersion on binary pairing model design
d_splt <- estimateDisp(d, splt_model_design)
qlfit_splt <- glmQLFit(d_splt, splt_model_design)
Now, we can use glmQLFTest to calculate the differential
expression according to this model.
# fit subtype model design
qlf.subtypes <- glmQLFTest(qlfit)
# fit split model design
qlf.her2p <- glmQLFTest(qlfit_splt,
coef='types_df$erTRUE')
qlf.erp <- glmQLFTest(qlfit_splt,
coef='types_df$her2TRUE')
At this point, we analyze the calculated p-values and aggregate our top hits for each model.
# top tags for aggregate model design
top_hits <- topTags(qlf.subtypes,
sort.by = "PValue",
n = nrow(normalized_counts))
# tog tags for split model design
top_her2p <- topTags(qlf.her2p,
sort.by = "PValue",
n = nrow(normalized_counts))
top_erp <- topTags(qlf.erp,
sort.by = "PValue",
n = nrow(normalized_counts))
For our three model designs (one aggregate, and two binary splits), the following number of genes were significantly differentially expressed before correction.
| Significantly Differentially Expressed Genes | |
|---|---|
| Aggregate across subtypes | 1462 |
| HER2 expression | 1240 |
| ER expression | 779 |
A threshold of 0.05 for p-values is very standard. It
implies that statistically we will only achieve this outcome
5% of the time assuming the null hypothesis is true (our
null hypothesis being that there is no differential expression between
the clinical subtypes in these genes).
If I continued this analysis and found that I had a lot of genes
remaining, I could lower the threshold, but 0.05 is a solid
starting point with that in mind.
edgeR’s analysis provides a FDR value,
which implies Benjamini-Hochberg, however I wanted to independently
verify.
corrected_top_hits <- p.adjust(top_hits$table$PValue,
method = "BH",
n = nrow(normalized_counts))
length(which(corrected_top_hits < 0.05))
## [1] 3
So, using this correction method, only 3 genes pass correction for significant differential expression.
As described, the False Discovery Rates provided by our analysis match my independent understanding of the Benjamini-Hochberg method.
This method is very applicable to our use-case since it is less stringent than Bonferroni when it comes to having a large number of hypotheses. It is best practice to use Benjamini-Hochberg first to see which of your genes pass correction, and if you need a more stringent corrective method, you can consider using Bonferroni. However, Bonferroni is said to be more useful when you are testing a smaller number of hypotheses which is generally not the case for RNA sequencing data.
For our two model designs, the following number of genes passed correction for each classifier.
| Significantly Differentially Expressed Genes | Significantly Differentially Expressed Genes after correction | |
|---|---|---|
| Aggregate across subtypes | 1462 | 3 |
| HER2 expression | 1240 | 0 |
| ER expression | 779 | 11 |
There is clearly a large reduction in number of genes significantly differentially expressed.
For the aggregate model of clinical subtypes, only 3
genes passed correction for false discovery. No genes
passed correction for HER2 expression association, whereas
11 genes passed correction for ER expression
association.
The genes and their associated values are shown in the tables below.
## logFC logCPM F PValue FDR
## CD207 2.742063 2.289659 28.65452 3.773769e-06 0.02946459
## MKX 3.847046 3.622901 31.37091 3.981701e-06 0.02946459
## ANXA8 3.552607 3.527024 29.51717 7.467909e-06 0.03684168
## logFC logCPM F PValue FDR
## EPHA3 3.105300 4.056737 59.47365 3.974726e-08 0.0005882595
## KCNMB2 2.661346 2.998343 42.09601 4.816566e-07 0.0035642587
## CLEC1B 2.473004 2.284272 33.75451 7.841777e-07 0.0038686101
## CYP2G1P 3.003659 2.742428 31.76628 1.579929e-06 0.0058457390
## CDH26 2.446980 3.017881 31.97038 5.149052e-06 0.0152411937
## CASC3 2.623309 8.002839 30.97451 7.846817e-06 0.0171345076
## TFF3 3.218604 2.873940 27.94817 8.104159e-06 0.0171345076
## RAPGEFL1 2.888305 4.761925 28.23030 1.585884e-05 0.0254678395
## MSL1 2.947257 8.590506 27.74632 1.706774e-05 0.0254678395
## MUC4 4.533581 5.796390 27.66638 1.720800e-05 0.0254678395
## THRA 1.998495 5.481885 25.63529 2.943228e-05 0.0395997889
An MA Plot demonstrates the log-fold change in contrast to the
log-concentration for our counts. I use edgeR’s MA plot
functionality for both model designs.
edgeR::maPlot(logAbundance=top_hits$table$logCPM,
logFC=top_hits$table$logFC,
de.tags=which(top_hits$table$FDR < 0.05),
lowess=TRUE,
xlab="log(counts per million)",
ylab="log(fold-change)",
main="Figure 4: MA Plot of significantly differentially
expressed genes after correction across
clinical subtypes, aggregate design")
legend("topright", title="Significance",
legend=c("not significant",
"significant after correction"),
col=c("black", "red"), pch=1, cex=0.8)
Figure 4: MA Plot of significantly differentially expressed genes after correction across clinical subtypes, aggregate design
edgeR::maPlot(logAbundance=top_erp$table$logCPM,
logFC=top_erp$table$logFC,
de.tags=which(top_erp$table$FDR < 0.05),
lowess=TRUE,
xlab="log(counts per million)",
ylab="log(fold-change)",
main="Figure 5: MA Plot of significantly differentially
expressed genes after correction
associated with ER expression")
legend("topright", title="Significance",
legend=c("not significant",
"significant after correction"),
col=c("black", "red"), pch=1, cex=0.8)
Figure 5: MA Plot of significantly differentially expressed genes after correction associated with ER expression
The volcano plot provides another visual for the spread of our differentially expressed genes and their significance, both before and after correction.
The green points are those with significant p-values before correction, with the red points being significant genes after correction.
plot(top_hits$table$logFC, -10*log10(top_hits$table$PValue),
main="Figure 6: Volcano Plot of significantly differentially
expressed genes after correction
across clinical subtypes, aggregate design",
xlab="log(fold change)",
ylab="-10^log(p-value)")
# highlight differentially expressed genes
points(top_hits$table$logFC[which(top_hits$table$PValue < 0.05)],
-10*log10(top_hits$table$PValue)[which(top_hits$table$PValue < 0.05)],
col="green")
points(top_hits$table$logFC[which(top_hits$table$FDR < 0.05)],
-10*log10(top_hits$table$PValue)[which(top_hits$table$FDR < 0.05)],
col="red")
legend("topleft", title="Significance",
legend=c("not significant",
"significant before correction",
"significant after correction"),
col=c("black", "green", "red"), pch=1, cex=0.8)
Figure 6: Volcano Plot of significantly differentially expressed genes after correction across clinical subtypes, aggregate design
plot(top_erp$table$logFC, -10*log10(top_erp$table$PValue),
main="Figure 7: Volcano Plot of significantly differentially
expressed genes after correction
associated with ER expression",
xlab="log(fold change)",
ylab="-10^log(p-value)")
# highlight differentially expressed genes
points(top_erp$table$logFC[which(top_erp$table$PValue < 0.05)],
-10*log10(top_erp$table$PValue)[which(top_erp$table$PValue < 0.05)],
col="green")
points(top_erp$table$logFC[which(top_erp$table$FDR < 0.05)],
-10*log10(top_erp$table$PValue)[which(top_erp$table$FDR < 0.05)],
col="red")
legend("topleft", title="Significance",
legend=c("not significant",
"significant before correction",
"significant after correction"),
col=c("black", "green", "red"), pch=1, cex=0.8)
Figure 7: Volcano Plot of significantly differentially expressed genes after correction associated with ER expression
This volcano plot of differentially expressed genes with regards to
ER- and ER+ is very interesting given its
strong skew towards up-regulation.
plot(top_her2p$table$logFC, -10*log10(top_her2p$table$PValue),
main="Figure 8: Volcano Plot of significantly differentially
expressed genes after correction
associated with HER2 expression",
xlab="log(fold change)",
ylab="-10^log(p-value)")
# highlight differentially expressed genes
points(top_her2p$table$logFC[which(top_her2p$table$PValue < 0.05)],
-10*log10(top_her2p$table$PValue)[which(top_her2p$table$PValue < 0.05)],
col="green")
points(top_her2p$table$logFC[which(top_her2p$table$FDR < 0.05)],
-10*log10(top_her2p$table$PValue)[which(top_her2p$table$FDR < 0.05)],
col="red")
legend("topleft", title="Significance",
legend=c("not significant",
"significant before correction",
"significant after correction"),
col=c("black", "green", "red"), pch=1, cex=0.8)
Figure 8: Volcano Plot of significantly differentially expressed genes after correction associated with HER2 expression
Both the aggregate model design, and the HER2 design are
fairly evenly distributed around 0.
Aggregating our top hits across clinical subtypes into a heatmap, we can see the expression visually as shown, along with a heatmap annotation corresponding to each binary pairing represented by the subtype classifier.
tops <- rownames(qlf.subtypes$table)[
qlf.subtypes$table$PValue < 0.05
]
top_hits_matrix = t(scale(
t(norm_matrix[which(rownames(norm_matrix) %in% tops),])
))
# colours
if (min(top_hits_matrix) == 0) {
heatmap_col = colorRamp2(c(0, max(top_hits_matrix)),
c("white", "red"))
} else {
heatmap_col = colorRamp2(c(min(top_hits_matrix), 0, max(top_hits_matrix)),
c("blue", "white", "red"))
}
# annotations
her2_col <- c("lightgreen", "red")
names(her2_col) <- c('TRUE', 'FALSE')
er_col <- c("darkgreen", "pink")
names(er_col) <- c('TRUE', 'FALSE')
types_df$her2 <- factor(types_df$her2, levels = c('TRUE','FALSE'))
types_df$er <- factor(types_df$er, levels = c('TRUE','FALSE'))
heatmap_annotation <- ComplexHeatmap::HeatmapAnnotation(
df = data.frame(her2 = types_df$her2,
er = types_df$er),
col = list(her2 = her2_col, er = er_col),
show_legend=TRUE
)
heatmap <- ComplexHeatmap::Heatmap(as.matrix(top_hits_matrix),
cluster_rows=TRUE,
cluster_columns=TRUE,
show_row_dend=TRUE,
show_column_dend=TRUE,
col=heatmap_col,
show_column_names=FALSE,
show_row_names=FALSE,
show_heatmap_legend=TRUE,
name="colour scale",
top_annotation=heatmap_annotation,
column_title="Figure 9: Heatmap of top hits across
clinical subtypes, aggregate design")
Figure 9: Heat map of top hits across clinical subtypes, aggregate design with binary labelling
Specifically for ER overexpression or lack thereof, I
added an additional heatmap.
tops_er <- rownames(qlf.erp$table)[
qlf.erp$table$PValue < 0.05
]
tops_er_matrix = t(scale(
t(norm_matrix[which(rownames(norm_matrix) %in% tops_er),])
))
# colours
if (min(tops_er_matrix) == 0) {
heatmap_col = colorRamp2(c(0, max(tops_er_matrix)),
c("white", "red"))
} else {
heatmap_col = colorRamp2(c(min(tops_er_matrix), 0, max(tops_er_matrix)),
c("blue", "white", "red"))
}
heatmap_er <- ComplexHeatmap::Heatmap(as.matrix(tops_er_matrix),
cluster_rows=TRUE,
cluster_columns=TRUE,
show_row_dend=TRUE,
show_column_dend=TRUE,
col=heatmap_col,
show_column_names=FALSE,
show_row_names=FALSE,
show_heatmap_legend=TRUE,
name="colour scale",
top_annotation=heatmap_annotation,
column_title="Figure 10: Heatmap of top hits
associated with ER overexpression")
Figure 10: Heatmap of top hits associated with ER overexpression
In figure 9, there is very visibly a cluster demonstrated between
heat intensity and ER overexpression or lack thereof with
the genes primarily in the top half being generally highly expressed in
lack of ER, and lowly expressed in overexpression of
ER. Similarly, the lower half of genes primarily follow the
opposite trend; lower expression on ER- and higher
expression on ER+.
In figure 10, it is more difficult to see the cluster patterns.
Sample 1 has very intense RNA expressions in the top half of genes, but
its matching HER2+/ER- subtype lacks this at all. This
implies the association goes in a broader scale than lack of
ER for these genes. Despite 11 genes being associated with
overexpression of ER, it does not appear to show a visible
cluster on the heatmap like the aggregate design in figure 9 does pretty
well.
With our significantly up-regulated, and down-regulated gene-sets, we will run a thresholded gene-set enrichment analysis.
We have 807 up-regulated genes, and 655 down-regulated genes for our aggregate model design across all clinical subtypes.
# create non-thresholded gene sets
nt_geneset <- qlf.subtypes$table
nt_geneset[,"rank"] <- -log10(nt_geneset$PValue) * sign(nt_geneset$logFC)
nt_geneset <- nt_geneset[order(nt_geneset$rank),]
# create thresholded up and down gene sets
upreg <- rownames(nt_geneset)[which(nt_geneset$PValue < 0.05
& nt_geneset$logFC > 0)]
downreg <- rownames(nt_geneset)[which(nt_geneset$PValue < 0.05
& nt_geneset$logFC < 0)]
# function for retrieving g:GOSt from g:Profiler
gost_res <- function(query) {
res <- gost(query = query,
significant=FALSE, # set our own threshold
ordered_query = TRUE, # ordered by rank
exclude_iea=TRUE, # exclude electronic GO annotations
correction_method = "fdr", # BH
organism = "hsapiens",
source = c("REAC","WP","GO:BP",
"GO:CC", "GO:MF", "KEGG"))
return(res)
}
I query our genesets using g:Profiler’s
g:GOSt.
# up and down regulated thresholded genes
all <- gost_res(unique(c(upreg, downreg)))
# only up regulated thresholded genes
up <- gost_res(upreg)
# only down regulated thresholded genes
down <- gost_res(downreg)
| p_value | term_id | term_name |
|---|---|---|
| 0.018483 | GO:0046032 | ADP catabolic process |
| 0.018483 | GO:0046031 | ADP metabolic process |
| 0.018483 | GO:0006096 | glycolytic process |
| 0.018483 | GO:0009057 | macromolecule catabolic process |
| 0.018483 | GO:0009137 | purine nucleoside diphosphate catabolic process |
| p_value | term_id | term_name |
|---|---|---|
| 0.0151737 | GO:0022613 | ribonucleoprotein complex biogenesis |
| 0.0151737 | GO:0009179 | purine ribonucleoside diphosphate metabolic process |
| 0.0151737 | GO:0009181 | purine ribonucleoside diphosphate catabolic process |
| 0.0151737 | GO:0009056 | catabolic process |
| 0.0151737 | GO:0046032 | ADP catabolic process |
| p_value | term_id | term_name |
|---|---|---|
| 0.0666454 | GO:0031000 | response to caffeine |
| 0.0666454 | GO:0071871 | response to epinephrine |
| 0.0666454 | GO:0071872 | cellular response to epinephrine stimulus |
| 0.0666454 | GO:0032989 | cellular anatomical entity morphogenesis |
| 0.0666454 | GO:0055006 | cardiac cell development |
I chose to use g:Profiler with R since it was well
documented by Ruth Isserlin [3] and I had familiarity with it
through my journal entry.
g:Profiler provides an ORA
(over-representation analysis) method, and this is exactly the method I
want to perform on my thresholded data.
I chose the following annotation sources to get the most comprehensive results for our functional enrichment analysis relative to biological processes and gene ontology:
GO:BP :
Gene Ontology: Biological ProcessesREAC : ReactomeWP : WikiPathwaysGO:CC :
Gene Ontology: Cellular ComponentsGO:MF :
Gene Ontology: Molecular FunctionKEGG :
Kyoto Encyclopedia of Genes and GenomesGO:BP, REAC, and WP were
recommended by Professor Isserlin, however I ran the ORA using these and
found nothing jumped out from the resulting plots. As such, I revisited
to expand the scope of annotation sources in hopes for a more
informative result.
For g:Profiler version information, I used:
g:Profiler version:
e112_eg59_p19_25aa4782biomaRt version: 112For the annotation sources selected in my query, I used:
GO:BP version annotations: BioMart classes:
releases/2024-10-27REAC version annotations: BioMart classes:
2025-2-3WP version 20250110GO:CC version annotations: BioMart classes:
releases/2024-10-27GO:MF version annotations: BioMart classes:
releases/2024-10-27KEGG version KEGG FTP Release
2024-01-22| Max 100 | Max 250 | Max 1000 | Max 100000 | |
|---|---|---|---|---|
| Up & Down | 991 | 1646 | 2160 | 2380 |
| Up | 558 | 1123 | 1610 | 1829 |
| Down | 126 | 324 | 570 | 643 |
| Max 100 | Max 250 | Max 1000 | Max 100000 | |
|---|---|---|---|---|
| Up & Down | 119 | 514 | 942 | 1132 |
| Up | 49 | 314 | 698 | 886 |
| Down | 6 | 49 | 152 | 193 |
After thresholding with maximum value of 100 and minimum intersection of 10, we can see the terms remain similar.
| term_id | term_name | p_value | significant | |
|---|---|---|---|---|
| 1 | GO:0046032 | ADP catabolic process | 0.018483 | TRUE |
| 2 | GO:0046031 | ADP metabolic process | 0.018483 | TRUE |
| 3 | GO:0006096 | glycolytic process | 0.018483 | TRUE |
| 5 | GO:0009137 | purine nucleoside diphosphate catabolic process | 0.018483 | TRUE |
| 6 | GO:0019364 | pyridine nucleotide catabolic process | 0.018483 | TRUE |
| term_id | term_name | p_value | significant | |
|---|---|---|---|---|
| 2 | GO:0009179 | purine ribonucleoside diphosphate metabolic process | 0.0151737 | TRUE |
| 3 | GO:0009181 | purine ribonucleoside diphosphate catabolic process | 0.0151737 | TRUE |
| 5 | GO:0046032 | ADP catabolic process | 0.0151737 | TRUE |
| 7 | GO:0046031 | ADP metabolic process | 0.0151737 | TRUE |
| 8 | GO:0006096 | glycolytic process | 0.0151737 | TRUE |
| term_id | term_name | p_value | significant | |
|---|---|---|---|---|
| 4 | GO:0032989 | cellular anatomical entity morphogenesis | 0.0666454 | FALSE |
| 7 | GO:0010927 | cellular component assembly involved in morphogenesis | 0.0666454 | FALSE |
| 48 | GO:0035051 | cardiocyte differentiation | 0.0676004 | FALSE |
| 135 | GO:0010469 | regulation of signaling receptor activity | 0.0937527 | FALSE |
| 4333 | GO:0005796 | Golgi lumen | 0.0532868 | FALSE |
As such, we find that up-regulated genes have some significant gene-sets after thresholding, whereas our down-regulated genes fail to maintain significance.
Our combination table shows us strong overlap with the up-regulated genes, further supporting the significance of our up-regulation skewed results.
The following figures demonstrate the over-representation analysis plots for both up & down regulated, up-regulated, and down-regulated genes respectively with only a couple of points manually highlighted.
Since the up-regulated gene results tend to exceed the down-regulated gene results in terms of term significance, the combined plot appears to more closely resemble the up-regulated plot.
In our case, ORA of up and down-regulated genes in combination did not have a large effect on the results, but this is because the up-regulated genes overshadow the down-regulated. If this were a different case, we could see the difference between the up- and down-regulated plots overshadowed and melted by the combination plot. As such, it is important to look at both up- and down-regulated separately.
Although, in terms of our results of this analysis, although some terms are significant according to our arbitrary threshold, none of them are strikingly so with maximum -log(adjusted p-values) only reaching up to about 3 maximum, whereas plots seen with stronger functional enrichment presence reach up much higher towards 10-16+.
The original paper includes substantially more processing of this
data in addition to the single-cell RNA sequencing data they acquired in
tandem, so it is difficult to say whether this supports those results.
None of my significant differentially expressed genes appear throughout
the report whatsoever. The discussion and results section for the bulk
RNA-sequencing is very lackluster compared to the rest of the analysis
and results found by spatial resolution of single-cell RNA sequencing.
The RNA sequencing heatmap appears similar to mine, although they used
scaled mean interaction score and subtyped only for ER and
TNBC as opposed to the four available clinical subtypes. I
can understand this approach since the HER2 subtype
presence was relatively small compared to the rest of the samples.
There is not much functional enrichment apparently distinguishing the classified clinical subtypes of breast cancer in this dataset; although if we had to look at the most significant term(s), our analysis would draw our attention to cadherin binding, ie. GO:0045296, of which is not present in the paper’s conclusions. As such, my findings do not support the mechanisms or conclusions discussed in the original paper, although that does not deny my findings.
Continuing with the most significant GO term from our analysis–cadherin binding–there is substantial evidence that cadherins are integral to mammary gland function, and misexpression of such leads to increased motility and consequently greater likelihood of metastases formation [4]. E-cadherin is one of the most widely expressed tumour suppressors in breast cancer, however cadherin-binding on its own is rather vague and difficult to draw conclusions from. We can conclude, however, that cadherins are associated with breast cancer in multiple ways, and so cadherin-binding may be a distinguishable functional enrichment between clinical subtypes–but this claim is too general to confidently back.
Interestingly, one of our three significantly differentially
expressed genes after correction across clinical subtypes of breast
cancer is ANXA8 which is associated with breast cancer
[5]. Rosetti et
al. suggested ANXA8 is significantly higher in
ER- subtypes as opposed to ER+ subtypes [6]
but instead of finding that in our results, we noticed
ANXA8 was significantly differentially expressed across all
clinical subtypes classified in our study and not specifically
ER overexpression or lack thereof. This warrants
investigation into why ANXA8 is expressed differentially in
ER subtypes according to research, yet selectively
expressed differentially across all four of our classified clinical
subtypes of breast cancer, and not specifically for ER
in-silico.
In terms of significantly differentially expressed genes in regards
to ER- and ER+ specifically, the most
significant gene is EPHA3 which has been shown to be
extensively more up-regulated in ER+ breast cancers [7],
which directly supports my findings. Although my findings were not
reflective of the mechanisms and direction of the paper I retrieved the
data from, they are supported by other evidence in certain ways.
In conclusion, there is some associated evidence that suggests my
results have a foundation in functional biology, however nothing has a
significance that would give me confidence to back my claim. For this, I
would want a p-value smaller than 10e-3.
This paper makes use of packages knitr [8],
BiocManager [9], GEOquery [10],
kableExtra [11], edgeR [12],
limma [13], ComplexHeatmap [14],
circlize [15], gprofiler2 [16],
GSA [17], & ggplot2 [18].